Attention(23) CUDA #26466

titaiwangms · 2025-10-31T22:54:14Z

This pull request introduces significant improvements and expanded support for multi-head attention kernels in ONNX Runtime, particularly focusing on supporting both 3D (BSNH) and 4D (BNSH) QKV input formats. The changes enhance flexibility, correctness, and maintainability for attention operations across CPU and CUDA implementations.

Expanded QKV Input Format Support

Added support for 4D QKV input format (Q_K_V_BNSH) in CUDA attention kernels, including proper handling for both cases with and without past/present states, and enforcing that bias is not supported for this format. This includes logic to avoid unnecessary transposes and to write outputs directly when possible. [1] [2] [3] [4] [5] [6] [7]

Kernel and Operator Documentation Updates

Updated OperatorKernels.md to document the new Attention operator inputs and outputs for both 3D and 4D formats, specifying supported tensor types for each input.

Correctness and Consistency Fixes

Fixed the computation of causal attention indices in CUDA softmax kernels by clarifying and correcting the offset calculation for causal masking. [1] [2] [3] [4]
Updated workspace allocation logic for QKV preparation to ensure correct workspace usage for new formats.

Attention Parameter and Helper Refactoring

Added is_output_bnsh field to AttentionParameters to indicate output format and updated logic to use this for output placement and transposition decisions. [1] [2]
Refactored CPU attention implementation to use the new attention_helper namespace for output mode enums and output shape computation, improving code clarity and maintainability. [1] [2] [3]

Minor Cleanups

Removed outdated asserts and improved debug output strings for QKV preparation functions to clarify format and state handling. [1] [2] [3]

These changes collectively improve the flexibility, correctness, and maintainability of attention kernel implementations in ONNX Runtime, especially for advanced transformer models and large language model workloads.

NOT supported in this PR

Boolean mask
GQA
Softcap
Softmax precision
qk_output_mode other than -1 and 0

This reverts commit 4db63bc.

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/core/providers/cuda/llm/attention.cc

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h

onnxruntime/core/providers/cpu/llm/attention_helper.h

This reverts commit a3e477e.

Reverts #26466 Failing [Attention(23) CUDA (#26466) · a3e477e](https://github.com/microsoft/onnxruntime/actions/runs/21018062539/job/60429722724)

This pull request introduces significant improvements and expanded support for multi-head attention kernels in ONNX Runtime, particularly focusing on supporting both 3D (`BSNH`) and 4D (`BNSH`) QKV input formats. The changes enhance flexibility, correctness, and maintainability for attention operations across CPU and CUDA implementations. ### Expanded QKV Input Format Support * Added support for 4D QKV input format (`Q_K_V_BNSH`) in CUDA attention kernels, including proper handling for both cases with and without past/present states, and enforcing that bias is not supported for this format. This includes logic to avoid unnecessary transposes and to write outputs directly when possible. [[1]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R264-R265) [[2]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R343-R354) [[3]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R388-L388) [[4]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R426-R435) [[5]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L673-R716) [[6]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11R747-R748) [[7]](diffhunk://#diff-25a30e78aab7a4cdd1d6ba9f3576fc36b79dd3404225d77ea2ee0018490a83eaL775-R791) ### Kernel and Operator Documentation Updates * Updated `OperatorKernels.md` to document the new `Attention` operator inputs and outputs for both 3D and 4D formats, specifying supported tensor types for each input. ### Correctness and Consistency Fixes * Fixed the computation of causal attention indices in CUDA softmax kernels by clarifying and correcting the offset calculation for causal masking. [[1]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL168-R168) [[2]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL244-R244) [[3]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL336-R336) [[4]](diffhunk://#diff-5367f3a93f596de362b09239a92fd1199b3c62fdded9e790810c80526ff9ec9bL442-R442) * Updated workspace allocation logic for QKV preparation to ensure correct workspace usage for new formats. ### Attention Parameter and Helper Refactoring * Added `is_output_bnsh` field to `AttentionParameters` to indicate output format and updated logic to use this for output placement and transposition decisions. [[1]](diffhunk://#diff-e742290164e1e1fa0152840db2a1b83354e153153df19a2762b58655e49b7f9bR37) [[2]](diffhunk://#diff-25a30e78aab7a4cdd1d6ba9f3576fc36b79dd3404225d77ea2ee0018490a83eaL775-R791) * Refactored CPU attention implementation to use the new `attention_helper` namespace for output mode enums and output shape computation, improving code clarity and maintainability. [[1]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7R5) [[2]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7L118-R125) [[3]](diffhunk://#diff-e692b5c865c4874e51982867901cd514e68cf38dd435c00fe505f34f93956fe7L143-R149) ### Minor Cleanups * Removed outdated asserts and improved debug output strings for QKV preparation functions to clarify format and state handling. [[1]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L254) [[2]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L363) [[3]](diffhunk://#diff-64c7062a412bd7e91378e5c40574de5a1bf63f42ec4cf7d2d23e812fde5bcd11L673-R716) These changes collectively improve the flexibility, correctness, and maintainability of attention kernel implementations in ONNX Runtime, especially for advanced transformer models and large language model workloads. **NOT supported in this PR** - Boolean mask - GQA - Softcap - Softmax precision - qk_output_mode other than -1 and 0

titaiwangms added 2 commits October 31, 2025 22:22

refactor redundant condition checks

1e7d5ae

sync to Xavier's cpu refactors

49d7a42

titaiwangms mentioned this pull request Oct 31, 2025

Skeleton for Attention(23) on CUDA #25684

Closed

xadupre mentioned this pull request Nov 4, 2025

[Feature Request] Implement aten::scaled_dot_product_attention for CUDA EP #26364

Open

titaiwangms added 8 commits November 4, 2025 23:47

Merge branch 'main' into titaiwang/support_attention_cuda

246a4d1

fix attention-cpu build

f244983

draft

8274bb1

lint - draft

78f5d61

Merge branch 'main' into titaiwang/support_attention_cuda

53d4e83

fix typo

43623ad

typo-2

08f15f6

update namespace

0e75443

titaiwangms added the ep:CUDA issues related to the CUDA execution provider label Nov 19, 2025

titaiwangms added 17 commits November 19, 2025 21:35

Merge branch 'main' into titaiwang/support_attention_cuda

277648d

update doc

5253dd0

removed deprecated functions in onnx

4db63bc

Revert "removed deprecated functions in onnx"

0a7e5f9

This reverts commit 4db63bc.

Merge branch 'main' into titaiwang/support_attention_cuda

6b18bb4

fix qkv space - support 3d default

a1ed3d9

turn 4d to tru on disable cuda

b462930

refactor attn_mask

0494e95

simplify

2dc706a

Merge branch 'main' into titaiwang/support_attention_cuda

5f0b6cd

support 4d and fix attn_mask bug

88e631c

disregard softcap and softmax_precision

000d394

Merge branch 'main' into titaiwang/support_attention_cuda

739e88f

fix offset in is_causal

6d6d478

add past_seq_length to softmax bias add for causal

792445a

resolve merge conflict

a26d812

update failing cuda tests

fbbf0b5

titaiwangms added 2 commits January 8, 2026 20:44

Merge branch 'main' into titaiwang/support_attention_cuda

7010308

delete past_sequence_length and use flag output_is_Q_K_V_BNSH

2c793b6

tianleiwu reviewed Jan 9, 2026

View reviewed changes

onnxruntime/contrib_ops/cuda/bert/attention_impl.cu Show resolved Hide resolved

tianleiwu reviewed Jan 9, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated Show resolved Hide resolved

add kv_sequence_length to softmax

ab41d04

titaiwangms requested a review from tianleiwu January 12, 2026 17:53

xadupre reviewed Jan 13, 2026

View reviewed changes

onnxruntime/core/providers/cuda/llm/attention.cc Show resolved Hide resolved

titaiwangms added 2 commits January 13, 2026 18:04

Merge branch 'main' into titaiwang/support_attention_cuda

4a8f502

remove kv_sequence_length in softmax and disable cross attn causal tests

2a9167e

titaiwangms modified the milestone: 1.24.0 Jan 13, 2026

tianleiwu changed the title ~~Attenion(23) CUDA~~ Attention(23) CUDA Jan 14, 2026

tianleiwu reviewed Jan 14, 2026

View reviewed changes

onnxruntime/contrib_ops/cpu/bert/attention_parameters.h Outdated Show resolved Hide resolved

tianleiwu reviewed Jan 14, 2026

View reviewed changes

onnxruntime/core/providers/cpu/llm/attention_helper.h Outdated Show resolved Hide resolved

tianleiwu previously approved these changes Jan 14, 2026

View reviewed changes

titaiwangms added 2 commits January 14, 2026 17:47

address reviews - comments

ff7e767

Merge branch 'main' into titaiwang/support_attention_cuda

c001885

titaiwangms dismissed tianleiwu’s stale review via c001885 January 14, 2026 17:48

titaiwangms added this to the 1.24.0 milestone Jan 14, 2026

titaiwangms requested review from tianleiwu and xadupre January 14, 2026 18:54

tianleiwu approved these changes Jan 14, 2026

View reviewed changes

titaiwangms enabled auto-merge (squash) January 14, 2026 21:25

titaiwangms merged commit a3e477e into main Jan 15, 2026
88 of 90 checks passed

titaiwangms deleted the titaiwang/support_attention_cuda branch January 15, 2026 02:57

titaiwangms added a commit that referenced this pull request Jan 15, 2026

Revert "Attention(23) CUDA (#26466)"

337340d

This reverts commit a3e477e.

titaiwangms mentioned this pull request Jan 15, 2026

Revert "Attention(23) CUDA" #27020

Merged

hariharans29 pushed a commit that referenced this pull request Jan 15, 2026

Revert "Attention(23) CUDA" (#27020)

00e2a8d

Reverts #26466 Failing [Attention(23) CUDA (#26466) · a3e477e](https://github.com/microsoft/onnxruntime/actions/runs/21018062539/job/60429722724)

titaiwangms mentioned this pull request Jan 15, 2026

[Reland] Attention(23) CUDA #27030

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Attention(23) CUDA #26466

Attention(23) CUDA #26466

Uh oh!

titaiwangms commented Oct 31, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Attention(23) CUDA #26466

Attention(23) CUDA #26466

Uh oh!

Conversation

titaiwangms commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Expanded QKV Input Format Support

Kernel and Operator Documentation Updates

Correctness and Consistency Fixes

Attention Parameter and Helper Refactoring

Minor Cleanups

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

titaiwangms commented Oct 31, 2025 •

edited

Loading